Tuomas Keski-Kuha 6.11.2021
Here’s a link to my IODS-project
Hi!
Nice to be on the Intro course on Open Data Science. I work in risk management field in insurance business. I have studied applied mathematics and statistics in University of Helsinki some 10 years ago. I expect to learn some basics about the open data science. I found this course at the Mooc courses page.
Looking forward to learn more about R, and my goal is to study more R statiscs so I’m albe to use it in my work also, and perhaps also as a hobby! :D So starting with happy feelings, but in a some hurry to finish first week task
Tuomas Keski-Kuha 13.11.2021
The data is from the study where intention was to study different variables affecting students grades. It is a question type study.
# first let's read the data:
learning2014 <- read.csv(file = "data/learning2014.csv",
stringsAsFactors = TRUE)
# let's study the data structure etc.
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender age attitude deep stra surf points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
In the data we have 166 observations (studens), and 7 variables. Some of the variables are describing the studens (gender, age, attitude) and (deep, stra, surf) are stundens’ answers in the study. Points are the points in the test, and the result variable.
Now let’s dig deeper in to the data, and also get an overview of it:
# ggplot library is allready installed and let's get the library into use
library(ggplot2)
#summarise data
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
# let's also use library GGally
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# then let's draw a plot where also the possible dependencies can be seen
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
In the summary picture you can really get the feeling of the data. Gender gives the coloring in the picture and all the information can be seen depending on the gender. In the picture you can see all the distributions of the variables.
Here’s few remarks on the distributions:
Here’s few remarks on the correlations:
Next, let’s apply linear model to the data. Response or dependent variable (is the variable you think shows the effect) here is the points and all other are predictors orindependent variables (is the variable you think is the cause of the effect).
Let’s choose three most valid independent variables by looking biggest correlations between the target variable (points). We get attitude, stra and surf as three biggest absolute correlation values.
# Let's create a linear regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
We fitted linear model: y = a0 + a1x1 + a2x2 +a3x3 Looking the results:
Of course model will fit differently if we just ignore first one of the non significant variables, but here it seems that small a2 or a3 would not enchange the model that much, and even those extra parameters could even worsen the models prediction capability. But let’s fit a linear model with two variables (attitude and stra):
# Let's create a linear regression model with multiple independent variable
my_model_2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model_2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Dropping the surf value did not affect that much, but at least it got for the a0 a better fit. Still overall results aren’t that impressive without the surf, if you look the residual standard error of residuals and also the residual values. So it seems that the best thing here would be to drop also the stra value (usually simple the better). It looks like residuals have the same kind of distribution as in the previous model.
Multiple R-squared value (ranges 0-1) gives answer the the proportion of the variance in the response variable (the effect) of a regression model that can be explained by the predictor variables (the cause). Low value here would indicate that the variance in response variable (points) is not explained that well by the predictors variables (attitude and stra). And here indeed R-squared value is not impressive, and there’s lot of variability which is not explained well by the predictors.
Also the adjusted R-squared did not change that much between the fist and the second model, and its score 0,2. Adjusted R-squared is more handy than Multiple R-squared, because it takes into account the number variables in the model and adjust the Multiple R-squared accordingly (R-squared willi increase as you add more variables). So with adjusted R-squared one can really evaluate the difference between the models when we consider the variance in the response variable.
For the fun of it, let’s also try fitting a simple line (y = a0+a1x1) where x1 is attitude.
# Let's create a linear regression model with just one independent variable
my_model_3 <- lm(points ~ attitude, data = learning2014)
summary(my_model_3)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
We observe that the standard error of the residuals became a bit larger even though the fitting of the parameters was more successfull. Adjusted R-squared still became lower which is not good, but is the difference here significant. I guess not if you observe the standard errors of the residuals (model vs. actual values) and compare them with previous modesl with multiple predictive variables.
Let’s first draw a simple regression line to the data. Attitude being predictive variable:
# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
# define the visualization type (points)
p2=p1+geom_point()
# add a regression line
p3 = p2 +geom_smooth(method = "lm")
# draw the plot
p3
## `geom_smooth()` using formula 'y ~ x'
Let’s draw diagnostic plots and choose the following plots:
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(my_model_3, which = c(1, 2, 5))
Commenting the plots:
Tuomas Keski-Kuha 21.11.2021
# access a few libraries:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
# first let's read the data:
alc<- read.csv(file = "https://github.com/rsund/IODS-project/raw/master/data/alc.csv",
stringsAsFactors = FALSE, sep = ",")
# let's study the data structure etc.
glimpse(alc)
## Rows: 370
## Columns: 51
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",~
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",~
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,~
## $ address <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",~
## $ famsize <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE~
## $ Pstatus <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",~
## $ Medu <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,~
## $ Fedu <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,~
## $ Mjob <chr> "at_home", "other", "at_home", "services", "services", "ser~
## $ Fjob <chr> "other", "other", "other", "health", "services", "health", ~
## $ reason <chr> "home", "reputation", "reputation", "course", "reputation",~
## $ guardian <chr> "mother", "mother", "mother", "mother", "other", "mother", ~
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,~
## $ studytime <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,~
## $ schoolsup <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",~
## $ famsup <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", ~
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"~
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye~
## $ internet <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes~
## $ romantic <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n~
## $ famrel <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,~
## $ freetime <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,~
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,~
## $ Dalc <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,~
## $ Walc <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,~
## $ health <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,~
## $ n <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~
## $ id.p <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,~
## $ id.m <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,~
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,~
## $ paid <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1~
## $ G1 <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,~
## $ G2 <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14~
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,~
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,~
## $ paid.p <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no~
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2~
## $ G1.p <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,~
## $ G2.p <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,~
## $ G3.p <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,~
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,~
## $ paid.m <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",~
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, ~
## $ G1.m <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1~
## $ G2.m <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ G3.m <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, ~
## $ alc_use <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,~
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,~
## $ cid <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,~
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "n" "id.p" "id.m"
## [31] "failures" "paid" "absences" "G1" "G2"
## [36] "G3" "failures.p" "paid.p" "absences.p" "G1.p"
## [41] "G2.p" "G3.p" "failures.m" "paid.m" "absences.m"
## [46] "G1.m" "G2.m" "G3.m" "alc_use" "high_use"
## [51] "cid"
Data set information (straight from the web page):
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Still we are going to use the data to predict binary variable for alcohol consuption (high_use).
Expected 4 variables to effect alcohol consumption are as follows (the variable picking was biased as I did see the results in the datacamp exercise :D ). Anyway let’s pick
Let’s draw some plots and tables if we could see whether there might be any relationship between target variable (high_use) and predictors.
# produce summary statistics by group for studying relationship between high_use and sex
alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 154
## 2 F TRUE 41
## 3 M FALSE 105
## 4 M TRUE 70
# let's draw a plot to study how absences might relate to the target variable
g1 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g1 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")
# initialize a plot of high_use related to gout
g2 <- ggplot(data = alc, aes(x = goout, fill = high_use))
# define the plot as a bar plot and draw it
g2 + geom_bar() + ggtitle("Student go-out-tendency by high alcohol consumption")
# let's count few crosstables for examining high_use relatedness to failures
table(high_use = alc$high_use, failures = alc$failures)
## failures
## high_use 0 1 2 3
## FALSE 238 12 8 1
## TRUE 87 12 9 3
In this chapter, let’s apply a logistic regression model to predict high alcohol use. For the model we use those variables which seemed have an effect to high alcohol use (sex, absences, go_out, failures).
# find the model with glm() (target high_use)
m <- glm(high_use ~ sex + goout + absences + failures, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + goout + absences + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1521 -0.7929 -0.5317 0.7412 2.3706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14389 0.47881 -8.654 < 2e-16 ***
## sexM 0.97989 0.26156 3.746 0.000179 ***
## goout 0.69801 0.12093 5.772 7.83e-09 ***
## absences 0.08246 0.02266 3.639 0.000274 ***
## failures 0.48932 0.23073 2.121 0.033938 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.50 on 365 degrees of freedom
## AIC: 379.5
##
## Number of Fisher Scoring iterations: 4
Model summary seems to tell that all of the predictive variables have good fit overall and are statistically significant (all others have p-value < 0.001, but for failures the p-value is 0,03 which is not that significant, and it is reasonable to assume that it won’t be significant for prediction power). Also here we cannot be sure that there might be strong correlations between predictors, so perhaps it could be wise to study those also.
Let’s also present the coefficients and also oddsrations, and confidence intervals:
# compute odds ratios (OR) round the results
OR <- coef(m) %>% exp %>% round(2)
# compute confidence intervals (CI), round the results
CI <- confint(m) %>% exp %>% round(2)
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02 0.01 0.04
## sexM 2.66 1.61 4.49
## goout 2.01 1.59 2.56
## absences 1.09 1.04 1.14
## failures 1.63 1.04 2.59
Did not have time to make adjustmens with the Intercept factor, but luckily it’s near zero, so it won’t matter that much in this case. It seems that sex and goout odds ratios are way bigger than one, and that would indicate that they are relevant for probability of high alcohol use. Failures and absences do still have figure more than one but just slightly, so they not seem to be so relevant for the high alcohol use probablity. Also the confidence levels for these values indicate that they are not that significant.
Let’s use the logistic regression model for predicting high alcohol use from the selected data. Well calculate probability for high alcohol use and then form a new variable which is true if probability is bigger than 0.5 and false if otherwise. After that we make a cross table for model prediction and actual high use variable (results of the study), and also a graphical presentation of the prediction vs. results.In the end also
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 16
## TRUE 61 50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
Model gives sensible results. We can see that there are wrong results 61+16 vs. correct ones 243+50.
Let’s also calculate a loss function, and use that for calculating inaccuracy of the model. This can also be calculated from the figures in the previous paragraph (incorrect cases divided by all cases).
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
round(loss_func(class = alc$high_use, prob = alc$probability), 4)
## [1] 0.2081
So the model is correct on average for 79 % of the cases and incorrect for 21 % of the cases.
It seems that it’s hard to predict high use (61 times model get it wrong vs. 50 correct ones), but for low uses it’s far more precise (16 wrong vs. 243 correct). But if you’d use simple guessing (flip of a coin for a model) it will be correct approx 50 % of the cases, so the model gives reasonable accuracy.
Let’s make a 10 fold cross-validation for validating the logistic regression model we used above:
# setting the random seed, so we can re-calculate exactly same results over again
seed = 123
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross-validation, this is not adjusted value, but a raw inaccuracy rate in the cross-validation
round(cv$delta[1], 4)
## [1] 0.2081
So cross-validation gives a really close figure comparing to prediction error we had when used all the data for training.
# Let's define the wanted predictor sets (just using the column numbers:
# first test set has many predictors (alomost all of them and more than 40)
testi_1 <- c(1:24, 27, 29:30, 31, 33:48, 50)
# we drop few off in testi_2 predictor set, but it still has quite many (approx 30)
testi_2 <- c(2:24, 29:31, 33:38, 50)
# we drop some more, still keeping the most relevant (approx 20)
testi_3 <- c(2:11, 12, 17, 24, 29:30, 31, 33:35, 50)
# now we keep just 10
testi_4 <- c(2:5, 24, 27, 30, 31, 33, 50)
# the last predictor set has only 3, number 50 is high_use (target variable)
testi_5 <- c(2, 24, 33, 50)
# build a list for different prodictor set
testi_setti <- list(testi_1, testi_2, testi_3,
testi_4, testi_5)
# initialising the result matrix
tulokset <- matrix(nrow = 2, ncol =5)
# defining a loop which takes one testi_setti items (predictor vectors) at a time, and calculates training error for the whole data set and also test error for the 10-fold cross-validation (all this code is copy pasted from above)
for (i in 1:5) {
pred_var <- testi_setti[[i]]
# find the model with glm() (target high_use)
m2 <- glm(high_use~., data = alc[pred_var], family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# call loss_func to compute the average number of wrong predictions in the (training) data
train_err <- round(loss_func(class = alc$high_use, prob = alc$probability), 4)
seed= 123
# cross-validation
cv <- cv.glm(data = alc[pred_var], cost = loss_func, glmfit = m2, K = 10)
# here we'll have results for cross-validation
test_err <- round(cv$delta[1], 4)
# saving the results to tulokset matrix
tulokset[1,i] <- train_err
tulokset[2,i] <- test_err
}
tulokset
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1757 0.1946 0.2000 0.2162 0.2108
## [2,] 0.2649 0.2757 0.2432 0.2243 0.2081
In tulokset matrix we have the result so that the first row is for training errors for the whole data and second row is test errors for cross-validation. Column 1 model has almost all predictor that one can pick and we decrease predictors (keeping the most relevant) till we get to fifth column which has only three.
It can bee noticed that training error tend to go up when there are fewer predictors in the model. The model learns better from the data and is more complex with more predictors, but on the other hand test error is very large. So the model with many predictors becomes dependent on the data that has been used, and it does not work that well when you fit it with different cross-validation sets. One would say that good model with significant predictor is far more robust, and simpler and passes the testing better.
Tuomas Keski-Kuha 28.11.2021
Exercise 2: Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here.
Let’s see the Boston data in the following:
# access a few libraries:
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.2
## corrplot 0.92 loaded
library(GGally)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.1.2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Data set information
Data is Housing Values in Suburbs of Boston. It has 506 rows and 14 columns. It has some information on the suburbs of Boston (towns). Here is the link to the data page, where one can see the full details.
Exercise 3: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
Let’s visualize the data and take a look of the distributions of the variables and relations between variables.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# First lets make a long table from Boston data
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
# Then plot every variable, so we'll see the distributions of variables
ggplot(data = melt.boston, aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
There is significant variability in the data as some variables are really skewed towards the minimum or the maximum value (most of the values are near or exactly minimum or maximum). There is also few integer variables in the data, which don’t fit quite well to the density plot above. Few of the variables seems normally distributed.
It is clear that there is really strong correlations (positive and negative) between variables in the data, as correlation plot shows. In the correlation plot the colour and the size of the circles tells about the correlations between variables.
Quite interestingly the rad (radial highway) accessibility) and the tax (property tax rate) variables have the biggest positive correlations between the crime rate in towns. Some might think that variables like dis (distances to employment centers) and lstat (lower status of population(percent)) would have more correlation between crime rate. In the plot we can also see that rad and tax are heavily correlated so they are almost like one variable.
We can also see that there seems to be significant variables like:
All in all, hard to gain significant information in this correlation analysis.
Exercise 4: Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
Let’s scale the variables so that the mean = 0 and variance = 1 for all variables in the dataset:
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We can see now that all the variables have negative and positive values around zero.
# perform principal component analysis (with the SVD method)
pca_boston <- prcomp(boston_scaled)
# define summaries so that we can compute real values of PC1 and PC2 (used for biplots) for both analysis made std and non std
s <- summary(pca_boston)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw biplots of the principal component representation and the original variables both for non standardized data and standardized
biplot(pca_boston, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
In the following section we create a new categorical quantile variable from crim variable, so that the new variable has labels low, med_low, med_high, high depending the scaled crim variable, and all the new variables have 25 % of the data.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
After that we make training and test sets from the data, so that we pick randomly 80 % of the rows to the training set and the rest to the test set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
set.seed(123)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Exercise 5: Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
Let’s fit the LDA model to training data and visualize results.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2500000 0.2500000 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 1.01866313 -0.9066422 -0.08120770 -0.8835724 0.38666334 -0.9213895
## med_low -0.07141687 -0.3429155 0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591 0.2162741 0.19544522 0.3637157 0.12480815 0.4564260
## high -0.48724019 1.0149946 -0.03371693 1.0481437 -0.47733231 0.8274496
## dis rad tax ptratio black lstat
## low 0.9094324 -0.6819317 -0.7458486 -0.4234280 0.3729083 -0.766883775
## med_low 0.3694282 -0.5395408 -0.4205644 -0.1079710 0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959 0.1049654 0.009720124
## high -0.8601246 1.6596029 1.5294129 0.8057784 -0.6383964 0.900379309
## medv
## low 0.47009410
## med_low 0.01548761
## med_high 0.19440747
## high -0.71294711
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.148390645 0.74870532 -1.0874785
## indus 0.040971465 -0.38126652 -0.1619456
## chas 0.002460776 0.03963849 0.1699312
## nox 0.312458150 -0.67564471 -1.3104018
## rm 0.011086947 -0.16058718 -0.1572603
## age 0.283482486 -0.38817624 -0.1013491
## dis -0.079848789 -0.38493775 0.2108038
## rad 3.718978412 0.93123532 -0.4706522
## tax -0.015197127 0.04230505 1.2889075
## ptratio 0.180294774 -0.01592588 -0.3558490
## black -0.136724112 0.02948075 0.1288959
## lstat 0.145739238 -0.37823065 0.3345688
## medv 0.061327205 -0.53906503 -0.1509890
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0364 0.0113
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Discrimination analysis seem to fit the data quite well, as plot indicates. Rad variable (index of accessibility to radial highways) and also zn (proportion of residential land zoned for lots over 25,000 sq.ft.) seem to have biggest effect on the classification process. It looks like rad variable gives really good discrimination for higher crime towns. For the lower crim towns it’s not that clear at all. In the picture it seems that rad variable separates higher crime towns from the rest.
Why then the radiation (index of accessibility to radial highways) has so significant effect. It seems not to be obvious that good connections would effect to crime rate that much and others won’t (like lower status population, industrialization or distance from employment centers). Would it perhaps be that there is more density of population near highways or something like that, or more population near highways. Anyway rad seems to be good indicator for high crime rate.
Exercise 6: Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
In the next phase we save crime gategories from the test set. Also we make predictions from test data, and see how accurate the model is:
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 10 4 0
## med_low 2 17 6 0
## med_high 1 9 13 2
## high 0 0 1 27
Model’s predicting power seems to be good indeed. Just few bigger inaccuracies, where correct class is two classes away. Still all results end up quite near the diagonal. Especially the high class seems to be “easy” to predict.
Exercise 7: Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Then, we calculate the distances and apply K-means algorithm to make clusters:
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled, centers = 3)
Finding the optimal value for the number of clusters:
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal value is the value where the WCSS (within cluster sum of squares) drops radically (here the optimal number of clusters for K-means seems to be 2).
# k-means clustering when k = 2
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston scaled dataset with clusters
pairs(boston_scaled, col = km$cluster)
We can see that perhaps crim, indus, rad, tax, black variables are best clustering variables (into two clusters), so let’s see figures with only those variables.
# lets make a vector containig significant separating variables
a_1 <- c(1, 3, 9, 10, 12)
# plot the Boston scaled dataset with clusters
pairs(boston_scaled[, a_1], col = km$cluster)
Presented variables seems to have largest effect on the clustering (into two clusers) as in the plot we can see good separation between the black and pink groups.
Next, we draw few plot more where one can see these significant variable values by clusters:
# let's draw few plots more where significant variable values are presented in a boxplot by clusters
par(mfrow=c(1,3))
boxplot(boston_scaled$rad ~ km$cluster, main = "Rad by clusters")
boxplot(boston_scaled$tax ~ km$cluster, main = "Tax by clusters")
boxplot(boston_scaled$black ~ km$cluster, main = "Black by clusters")
boxplot(boston_scaled$indus ~ km$cluster, main = "Indus by clusters")
boxplot(boston_scaled$crim ~ km$cluster, main = "Crim by clusters")
Perhaps one would not become wiser on looking drawn boxplots because many of the picked significant clustering variables has just few instances by cluster (e.g. crim, indus, tax), so one can critisize the separating power here by just one variable. But at least we had more some nice pictures here.
Bonus exercise: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
# Load the data again and after that standardize all variables (mean = 0 and variance = 1)
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# k-means clustering
km <-kmeans(boston_scaled, centers = 5)
Let’s add clusters to the data and then apply linear discrinmination analysis to the clusters that we got from K-means clustering:
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, km$cluster)
set.seed(123)
# linear discriminant analysis
lda.fit_2 <- lda(km.cluster ~ ., data = boston_scaled)
# plot the lda results
plot(lda.fit_2, dimen = 2, col = boston_scaled$km.cluster, pch = boston_scaled$km.cluster)
lda.arrows(lda.fit_2, myscale = 1)
It seems clear that chas and rad variables are most influencial linear separators for the five clusters. Rad we handled before, it would seem that the highway accessibility makes some difference and separates towns. Chas is indicating that the riverside towns are significantly different than the non-riverside towns.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~train$crime)
# k-means clustering
km_2 <-kmeans(matrix_product, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~km_2$cluster)
Not really sure what was the intention with the latter 3D-plot. I applied K-means (k=4, same amount than in the first plot where predicted) to matrix product data (prediction for the LDA model) intending then to do clustering for predictions. This latter K-means clusterin makes data to look reaaly smoothly separated and there is no outliers or obvious outliers, or data points “inside” other group as in the first one. Now that I look at the second plot, it looks like just the same as you would have if you plot predictions od LDA model.
Tuomas Keski-Kuha 2.12.2021
Exercise 1: Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
The human data originates from the United Nations Development Programme. See their data page for more information. For a nice overview see also the calculating the human development indices pdf.
In the data we have countries and their ranking based on Human Development Index (HDI) and Gender Inequality Index (GII). Both of them are aggregated indexes constructed from other measures. In addition to these indexes there are several other variables. In analysis below we use mainly variables which are explained here. Few words about main indexes:
In the following we explore the data with only limited number of variables (numerical variables only).
# access a few libraries just in case:
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(reshape2)
library(plotly)
# read the human data
#human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep =",", header = T)
# set the working directory to iods project folder
setwd("c:/Tuomas/Opiskelu/Open Data Science/IODS-project")
# read human from data wrangling exercise output
human <- read.csv(file = "data/human.csv",
stringsAsFactors = FALSE)
# add countries as rownames
rownames(human) <- human$X
# remove the first variable which is the country
human <- human[, -1]
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# look at the structure of human
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# print out summaries of the variables
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# let's draw ggpairs -plot to see an overview of the data
ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot
We see from plots that most of the variables are quite nicely distributed (normally), but there are exeptions like GNI (gross national income per capita) and Mat.Mor (maternal mortality ratio). It indicates that wealth and health differences are very significant between countries.
There are very stong negative and positive correlations between variables. Couple of correlations seems to have straightforward explanations like (Life expectancy between years in schooling and LIfe expectancy between maternal mortality). Let’s note the following:
Exercise 2: Perform principal component analysis (PCA) on the not standardized human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.
Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).
Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
Here the results are dominated by GNI which has very strong absolute variability, because data was not standardized and GNI values are far different than other variables (absolute differences are so large). Standardization seems like a good plan.
Exercise 3: Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to.
# standardize the variables
human_std <- scale(human)
# perform principal component analysis to the std. data (with the SVD method)
pca_human_std <- prcomp(human_std)
# define summaries so that we can compute real values of PC1 and PC2 (used for biplots) for both analysis made std and non std
s <- summary(pca_human)
s_std <- summary(pca_human_std)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
pca_pr_std <- round(100*s_std$importance[2,], digits = 1)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
pc_lab_std <- paste0(names(pca_pr_std), " (", pca_pr_std, "%)")
# draw biplots of the principal component representation and the original variables both for non standardized data and standardized
biplot(pca_human, main = "PDA on non standardized data", cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab_std[1], ylab = pc_lab_std[2], main = "PDA on standardized data")
Standardazing seems to make PDA model more effective. There are clear direction to the arrows for the variables. The first PDA model on the non standardized had only one explanatory variable on the PD1 which was GII and all others explained the rest of the data variability (PD1 explained 100 % of the variability in the country data). So the latter plot is completely different than the previous as in the arrows and the values of PC1 and PC2 (PD1 explains 53,6 % and PD2 16,2 % of the variability in the data). Also the counties are far more evenly spread on the plot, as in the non std. analysis there the most of the countries were on the top right and what the model did was separate rich countries from the rest, and picked some very poor countries to the low right.
On the lengths of the arrows (standard deviation of variable) the standardization can be seen really nicely as in the latter the standard deviation is one for all of the variables, so all arrows have almost same length.
It’s hard to say which is not different with these two plots. GNI arrow can be seen in both and some same very rich countries are pointing out in the both plots (like traditional oil countries Qatar, Kuwait, United Arabic Emirates).
Exercise 4: Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
It looks like that health, wealth and knowledge differences between countries are explained by PC1 (variables like GII, Life.Exp and Mat.Mor). The rest of the variability in the data is explained by gender based factors like labour force differences (Labo.FM) and women in power (Parli.F). The independency between these gender based variables and others can be easily seen in the plot above as arrows are approximately non-parallel (somewhat 90 agrees between them).
Also the correlations between “PC1” variables (positive and negative) are really well presented here (as arrows are pointing same direction or straight to the opposite), and these variables have strong correlation between each other, and behave as a group regarding the variability between countries.
In the plot rich countries are on the left and poor on the right. In the top are more equal between sexes countries (Rwanda is interesting outlier, don’t know if the women are more involved in labour force as it’s quite underdeveloped country due to its dark history. Perhaps there is more agriculture where also women are more involved) and counties in the bottom are more inequal.
Exercise 5: Load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.
Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction. In other words it tries to pack information from the data to some categorical variables (reduce variables to dimension variables), so that we can in a way group individual observations.
The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).
# access a FactoMineR package and also factoextra:
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
##read tea-data
data("tea")
# look at the structure of tea
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# visualize the tea data in three plots
gather(tea[, 1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
## Warning: attributes are not identical across measure variables;
## they will be dropped
gather(tea[, 13:24]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
## Warning: attributes are not identical across measure variables;
## they will be dropped
gather(tea[, 25:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
## Warning: attributes are not identical across measure variables;
## they will be dropped
We can see that all but one (age) are categorical variables, and for them is suitable to just count cases in the plots.
In the following sections we do two MCAs (this is not exactly what was asked):
There was quite a lot ready-made and handy R-code for plots in the web and I used them without any hesitation in analysis for MCA.
In MCA analysis soon to follow, the first 18 questions in the Tea-data are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables. Supplementary variables are not actually used in following MCA, so if we apply all variables to MCA it means just the active variables (the first 18).
# let's run MCA first on all of the categorical variables and then decide which variables we uses in further analysis, most effective variables seen in variables representation for two first dimensions of MCA. Don't use here the graphs = FALSE, so we'll get graphs from the analysis
res.mca=MCA(tea,quanti.sup=19,quali.sup=20:36)
## Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Active variables are red and supplementary variables green ones in the first plot above. In the second MCA factor map plot are plotted individuals (all the data) into MCA model dimensions 1 and 2. In factor maps the distance between any row points (individuals) or column points (variable vales) gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.
Comments on factor maps of two first dimensions:
Variables representation gives correlations of variables to Dim 1 and Dim 2) where, how, price are the most relevant for the first two dimensions.
There could be more variable values which separate individuals in higher dimensions also, but these first dimension explain more of the variation in the data.We can pick also the variation that each dimension retains and see that in the plot:
# To visualize the percentages of inertia (variation) explained by each MCA dimensions, use the function fviz_eig() or fviz_screeplot() [factoextra package]:
fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 11))
In this plot we can see that the first two are more relevant than others, but explain only ~18 % of the variation.
We can also calculate the contributions from categories (variable values) for the first three dimensions and plot the whole variable set to the heatmap-kind-of-plot where contributions can be easily seen.
# Let's calculate also contributions to
# Contributions of variables to DIM 1
fviz_contrib(res.mca, choice = "var", axes = 1, top = 10)
# Contributions of variables to DIM 2
fviz_contrib(res.mca, choice = "var", axes = 2, top = 10)
# Contributions of variables to DIM 2
fviz_contrib(res.mca, choice = "var", axes = 3, top = 10)
# Heat-map for contribtions of the first two dimensions
fviz_mca_var(res.mca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # avoid text overlapping (slow)
ggtheme = theme_minimal()
)
## Warning: ggrepel: 26 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Let’s pick the most significant variables from Variable categories plot for dimensions 1 and 2 for next MCA analysis. Those which are near the origo are not significant to first two dimensions. Let’s also pick some variables which are significant for dimension 3.
Perhaps the whole analysis (if want to focus on the two first dimensions) would be just as good with just three variables: price, where, how, but let’s keep some more tearoom, friends, resto, Tea, How in the second analysis below.
# We pick the most relevant from the analysis for the whole data and do the MCA again with those variab
keep_columns <- c("where", "tearoom", "friends", "how", "price", "Tea", "resto", "How")
# select the 'keep_columns' to create a new dataset
tea_time <- tea[, keep_columns]
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis on tea_time dataset, so it's not a whole data
mca <- MCA(tea_time, graph = TRUE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = TRUE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.289 0.255 0.168 0.150 0.134 0.132 0.128
## % of var. 13.598 12.005 7.893 7.039 6.327 6.227 6.005
## Cumulative % of var. 13.598 25.603 33.497 40.536 46.863 53.091 59.096
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.124 0.111 0.105 0.100 0.095 0.085 0.078
## % of var. 5.840 5.204 4.931 4.723 4.488 4.013 3.671
## Cumulative % of var. 64.935 70.140 75.070 79.793 84.282 88.295 91.966
## Dim.15 Dim.16 Dim.17
## Variance 0.066 0.063 0.041
## % of var. 3.122 2.970 1.942
## Cumulative % of var. 95.088 98.058 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.629 0.456 0.101 | 0.079 0.008 0.002 | 0.910
## 2 | -0.387 0.173 0.098 | 0.001 0.000 0.000 | 0.593
## 3 | -0.109 0.014 0.012 | -0.457 0.273 0.218 | -0.137
## 4 | -0.459 0.243 0.256 | -0.036 0.002 0.002 | -0.186
## 5 | -0.459 0.243 0.256 | -0.036 0.002 0.002 | -0.186
## 6 | -0.731 0.616 0.235 | 0.089 0.010 0.003 | 0.082
## 7 | -0.277 0.088 0.117 | -0.205 0.055 0.064 | -0.369
## 8 | -0.387 0.173 0.098 | 0.001 0.000 0.000 | 0.593
## 9 | 0.376 0.163 0.069 | 0.162 0.034 0.013 | 0.155
## 10 | 0.807 0.751 0.264 | 0.291 0.111 0.034 | 0.330
## ctr cos2
## 1 1.645 0.211 |
## 2 0.698 0.228 |
## 3 0.038 0.020 |
## 4 0.069 0.042 |
## 5 0.069 0.042 |
## 6 0.013 0.003 |
## 7 0.270 0.208 |
## 8 0.698 0.228 |
## 9 0.048 0.012 |
## 10 0.217 0.044 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## chain store | -0.590 9.653 0.620 -13.614 | -0.139 0.610
## chain store+tea shop | 1.114 13.954 0.436 11.417 | -0.519 3.434
## tea shop | 0.883 3.373 0.087 5.090 | 2.243 24.645
## Not.tearoom | -0.298 3.094 0.370 -10.517 | 0.073 0.210
## tearoom | 1.242 12.910 0.370 10.517 | -0.304 0.876
## friends | 0.272 2.085 0.139 6.448 | -0.236 1.788
## Not.friends | -0.512 3.930 0.139 -6.448 | 0.445 3.371
## tea bag | -0.608 9.073 0.484 -12.030 | -0.156 0.678
## tea bag+unpackaged | 0.761 7.854 0.264 8.892 | -0.410 2.586
## unpackaged | 0.885 4.069 0.107 5.653 | 1.810 19.257
## cos2 v.test Dim.3 ctr cos2 v.test
## chain store 0.035 -3.216 | 0.076 0.277 0.010 1.758 |
## chain store+tea shop 0.095 -5.322 | -0.146 0.411 0.007 -1.493 |
## tea shop 0.559 12.927 | -0.109 0.089 0.001 -0.630 |
## Not.tearoom 0.022 2.574 | -0.052 0.165 0.011 -1.850 |
## tearoom 0.022 -2.574 | 0.219 0.688 0.011 1.850 |
## friends 0.105 -5.611 | -0.207 2.088 0.081 -4.915 |
## Not.friends 0.105 5.611 | 0.390 3.934 0.081 4.915 |
## tea bag 0.032 -3.091 | 0.160 1.083 0.034 3.167 |
## tea bag+unpackaged 0.077 -4.794 | -0.375 3.282 0.064 -4.379 |
## unpackaged 0.447 11.556 | 0.223 0.444 0.007 1.422 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## where | 0.624 0.586 0.010 |
## tearoom | 0.370 0.022 0.011 |
## friends | 0.139 0.105 0.081 |
## how | 0.485 0.460 0.065 |
## price | 0.421 0.414 0.254 |
## Tea | 0.035 0.174 0.428 |
## resto | 0.101 0.202 0.111 |
## How | 0.136 0.078 0.382 |
From the summary Categorical variables, we can see the squared correlations between each variable and the dimensions. If the value is close to one it indicates a strong link with the variable and dimension. Here we can see that most important for the dimensions:
These were the quite same variables which could be seen in the previous MCA model, where all variables were used.
In the next plot we see percentages of the variance in the data that dimension variables retain in the analysis.
# To visualize the percentages of inertia (variation) explained by each MCA dimensions, use the function fviz_eig() or fviz_screeplot() [factoextra package]:
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 15))
In this plot we can see that the first two are more relevant than others. Still they explain only ~25 % of the variation.
Let’s plot the results to the factor map where we can see variables by different colous. In the factor map we can see how similar (small distance) or dissimilar (big distance) these categories are. Also in the second factor map we can see the contributions to first two dimensions:
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
# heat map for contributions:
fviz_mca_var(mca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # avoid text overlapping (slow)
ggtheme = theme_minimal()
)
Comments:
#let's look at How-variable values:
summary(tea_time$How)
## alone lemon milk other
## 195 33 63 9
All in all, I think that in the end for two first dimensions there can be seen three groups (with straight forward discrimination and some might challenge these conclusions):
We can also plot an extra plot where we can see the factor maps for some key variables. Plot draws an confidence ellipses for variable values (also individuals can be seen here and their answers also to there questions) which indicate the statistical significance of variable value difference. If ellipses cross, it indicates that there is no statistical difference between variable values. On the other hand if ellipses are far from each other, then variable values statistically really differ on each other.
fviz_ellipses(mca, c("where", "how", "price", "How"),
geom = "point")
Few comments:
Tuomas Keski-Kuha 12.12.2021
# access libraries:
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
library(GGally)
library(reshape2)
library(plotly)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.4 v stringr 1.4.0
## v readr 2.1.0 v forcats 0.5.1
## v purrr 0.3.4
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks plotly::select(), dplyr::select()
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.1.2
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:stats':
##
## filter
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.2
library(lme4)
## Warning: package 'lme4' was built under R version 4.1.2
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
Exercise 1: Implement the analyses of Chapter 8 of MABS using the RATS data. (0-7 points: 0-4 points for graphs or analysis results + 0-3 points for their interpretations)
Graphical displays of data are almost always useful for exposing patterns in the data, particularly when these are unexpected; this might be of great help in suggesting which class of models might be most sensibly applied in the later more formal analysis.
In the following we have the RATS data which contains rat weight developments in different diets. In the data we have three groups of individual rats that were put on a different diets, and each rat’s (16 rats in total) weight in grams was measured repeatedly over a 9-week period.
Goal of the study was to determine does these diets have different responses on the rat growth. Let’s first load the data (long form data) and let’s draw a plot where all the information can be seen.
# load the data
# set the working directory to iods project folder
setwd("c:/Tuomas/Opiskelu/Open Data Science/IODS-project")
RATSL <- read.csv(file = "data/RATSL.csv")
RATSL <- within(RATSL, {
ID <- factor(ID)
Group <- factor(Group)
})
# read original RATS data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep ="\t", header = T)
# changing categorical variables to factors
RATS <- within(RATS, {
ID <- factor(ID)
Group <- factor(Group)
})
# draw a plot of the rat data
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")
Comments:
Let’s standardize all the rat weights in the next part and let’s see if there could be found more variance between the groups.
# Standardise the scores for RATS
RATSL_std <- RATSL %>%
group_by(Time) %>%
mutate( stdrats = (Weight - mean(Weight))/sd(Weight) ) %>% ungroup()
glimpse(RATSL_std)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3~
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,~
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",~
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47~
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,~
## $ stdrats <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, -0~
p1 <- ggplot(RATSL_std, aes(x = Time, y = stdrats, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized rats")
p6
Remarks:
In the next chapter we calculate summary measures for each group and compare differences between groups.
In the first plot we draw the mean weights and standard deviations around the mean (mean +/- standard deviation).
# Number of weeks, baseline (week 0) included
n <- RATSL$Time %>% unique() %>% length()
# Make a summary data:
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean=mean(Weight), se=sd(Weight)/sqrt(n)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
p1 <- ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group))
p2 <- p1 + geom_line() + scale_linetype_manual(values = c(1,2,3))
p3 <- p2 + geom_point(size=3) + scale_shape_manual(values = c(1,2,3))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=1.5)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + theme(legend.position = c(0.9,0.45))
p7 <- p6 + scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
p7
In the plot above:
Next we’ll focus on the growth patterns of the rats. Let’s calculate the growth percents of all rats at every measuring point and check if there’s anything interesting there. After the growth percent calculation, we just study the means of growth percents (from all the measuring points) the groups and plot them into a boxplot.
# Let's use the wide data in which it's perhaps easier to calculate the growth percents
RAT_g <- RATS
# we need a loop (for columns, time points) for calculate the growth percents for all the rats and all
for (i in 4:13) {
RAT_g[, i] = (RATS[, i]-RATS[, i-1])/RATS[, i-1]*100
}
# let's drop the baseline (first weight)
RAT_g <- RAT_g[, -3]
# modify RAT_g to long data
RAT_gL <- gather(RAT_g, key = WD, value = grow, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
# Number of weeks, baseline (week 0) included
n <- RAT_gL$Time %>% unique() %>% length()
# Make a summary data:
#RAT_gS <- RAT_gL %>%
# group_by(Group, Time) %>%
# summarise( mean=mean(grow), se=sd(grow)/sqrt(n)) %>%
# ungroup()
# let's calculate the means of the growth percents for all individual rats
RAT_gS1 <- RAT_gL %>%
group_by(Group, ID) %>%
summarise( mean=mean(grow) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# plot the mean growth percents
ggplot(RAT_gS1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(growth percent)")
This plot shows perhaps something interesting in the data, but of course the variability in the group 1 is smaller initially because there’s more rats in the group. But here we perhaps see that the mean of the group 2 is different than the other groups if we only focus on the growth pattern. But it’s hard to make strong statements based only on these. Perhaps one could also compare growth percents for all time points separately.
Let’s also look at some statistic calculations of the summary measured data. We take the individual rat baseline into linear model analysis to check how relevant factor it is. We’ll also fi linear model without baseline as a predictor to see the difference.
# Create a summary data by group and id with mean as the summary variable (ignoring baseline week 0).
RATSL8S <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Add the baseline from the original data as a new variable to the summary data
RATSL8S2 <- RATSL8S %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit_1 <- lm(mean ~ Group, data = RATSL8S2)
summary(fit_1)
##
## Call:
## lm(formula = mean ~ Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.886 -27.031 2.284 10.588 105.750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 263.72 12.96 20.346 3.06e-11 ***
## Group2 220.99 22.45 9.844 2.16e-07 ***
## Group3 262.08 22.45 11.674 2.91e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.66 on 13 degrees of freedom
## Multiple R-squared: 0.9313, Adjusted R-squared: 0.9207
## F-statistic: 88.07 on 2 and 13 DF, p-value: 2.763e-08
# Fit the linear model with the mean as the response with baseline as predictor
fit_2 <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit_2)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.732 -3.812 1.991 6.889 13.455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.14886 19.88779 1.516 0.1554
## baseline 0.93194 0.07793 11.959 5.02e-08 ***
## Group2 31.68866 17.11189 1.852 0.0888 .
## Group3 21.52296 21.13931 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9933
## F-statistic: 747.8 on 3 and 12 DF, p-value: 6.636e-14
# Compute the analysis of variance for the fitted models with anova()
anova(fit_2, fit_1)
## Analysis of Variance Table
##
## Model 1: mean ~ baseline + Group
## Model 2: mean ~ Group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 12 1352.4
## 2 13 17471.4 -1 -16119 143.02 5.023e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From these analysis we can see that:
Exercise 2: Implement the analyses of Chapter 9 of MABS using the BPRS data. (0-8 points: 0-4 points for graphs or analysis results + 0-4 points for their interpretations)
In this chapter we’ll do several analyses which all are linear regression models as title suggests (Linear Mixed Effect Models), but with some added parameters and features. Aim is to have a better model for repetitive data than just basic linear regression, which assumes that all the observations are independent. Here are the four models that we’ll apply (in the brackets we shorten the model name):
The data that we use is the BPRS data 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
Let’s load the BPRS data, take a glimpse, and draw a plot of it.
# load the data
# set the working directory to iods project folder
setwd("c:/Tuomas/Opiskelu/Open Data Science/IODS-project")
BPRSL <- read.csv(file = "data/BPRSL.csv")
BPRSL <- within(BPRSL, {
treatment <- factor(treatment)
subject <- factor(subject)
})
# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0~
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, ~
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Comments of the data:
First, we try just a simple linear regression model (LRM) on the BPRS-data with bprs as response and week and treatment as explanatory variables. Here we ignore the repeated-measures structure of the data, and assume that observations are all independent knowing that there is same subjects measured multiple times and these observations tend to correlate with each other as we measure same subject.
The form of (LRM) with two explanatory variables (week (time) \(T\) and treatment \(X\)) is \[ Y \sim \beta_0 + \beta_1 T + \beta_2 X + \epsilon \] Here the residual term \(Y(t, x_i)-y_i\) (the difference between an observed and predicted value - also model error) is normally distributed random variable \(\epsilon\) with zero mean and variance \(\sigma^2\).
# create a regression model BPRS_reg
BPRS_LRM <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_LRM)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
plot(BPRS_LRM, 1)
Commentary:
The previous model assumes independence of the repeated measures of bprs, and this assumption is highly unlikely. So, no we try something more suitable for repetitive type of study.
To begin the more formal analysis of the BPRS data, we will first fit the random intercept model (RIM) for the same two explanatory variables: week and treatment Fitting a random intercept model allows the linear regression fit for each subject to differ in intercept from other subjects. So in this model it’s possible to have different profiles with symptom development.
The form of (RIM) for the subject \(i\) and at the time \(t_j\) where \(j\) represent the week \[ y_{ij} = (\beta_0 + u_i) + \beta_1 t_j + \beta_2 x_i + \epsilon_i \] Here the variance of each repeated measurement is \({Var}(y_{ij}) = {Var}(\epsilon_i + u_i) = \sigma^2+u_i^2\) (independent of time), where random effects
RIM kind of split the random effect in (LRM) to individual random effect which depends of the subject and the basic random effect.
So let’s fit this model next:
# Create a random intercept model
BPRS_RIM <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_RIM)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
Notes:
Now we can move on to fit the random intercept and random slope model (SLO) to the data. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the bprs development profiles, but also the effect of time.
The form of (SLO) for the subject \(i\) and at the time \(t_j\) where \(j\) represent the week \[ y_{ij} = (\beta_0 + u_i) + (\beta_1+v_i) t_j + \beta_2 x_i + \epsilon_{ij} \] Here the variance of each repeated measurement is \[ {Var}(y_{ij}) = {Var}(u_i + v_it_j +\epsilon_i) = \sigma_u^2+2\sigma_{uv}t_j+\sigma_v^2t_j^2+\sigma^2, \] where random effects
Let’s note here that this model allows correlation between residuals (at different time points) of the same individual.
# create a random intercept and random slope model
BPRS_SLO <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_SLO)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
Commenting:
Finally, we can fit a random intercept and slope model that allows for a treatment × time interaction.
The form of INT for the subject \(i\) and at the time \(t_j\) where \(j\) represent the week \[ y_{ij} = (\beta_0 + u_i) + (\beta_1+v_i) t_j + \beta_2 x_i + \beta_3 x_it_j + \epsilon_{ij}. \]
# create a random intercept and random slope model with week x treatment interaction
BPRS_INT <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_INT)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
Remarks:
Next let’s draw plots for different model fits to compare.
# Create a vector of the fitted values
fitted_RIM <- fitted(BPRS_RIM)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(fitted_RIM)
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype = treatment)) + ggtitle("Original observations") +
theme(legend.position = "top")
# Draw the plot
ggplot(BPRSL, aes(x = week, y = fitted_RIM, group = subject)) +
geom_line(aes(linetype = treatment)) + ggtitle("Random intercept fit") +
theme(legend.position = "top")
# Create a vector of the fitted values
fitted_SLO <- fitted(BPRS_SLO)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(fitted_SLO)
# draw the plot
ggplot(BPRSL, aes(x = week, y = fitted_SLO, group = subject)) +
geom_line(aes(linetype = treatment)) + ggtitle("Random intercept + slope fit") +
theme(legend.position = "top")
# Create a vector of the fitted values
fitted_INT <- fitted(BPRS_INT)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
mutate(fitted_INT)
# draw the plot
ggplot(BPRSL, aes(x = week, y = fitted_INT, group = subject)) +
geom_line(aes(linetype = treatment)) + ggtitle("Random intercept + slope + interaction fit") +
theme(legend.position = "top")
Random intercept with random slope with and without interaction is notably different than the random intercept. And perhaps the random intercept with random slope is better for capturing those subject whose symptoms are not decreasing.
It’s quite hard to compare with the actual data because it’s so variable and it looks like in the original data it is hard to fit linear model for it. Individual slopes seem to capture the effect that perhaps those who have stronger symptoms will have more “space” to recover and getting better than with those subjects who are better at the start.
This week was really tough to manage and have enough time for exercises. But I managed to squeeze these in time. I think I learned quite a bit this week for these model because it was a struggle. And with better R skills it would have been a lot easier. Intepretation were really challenging this week I think and linear mixed effect models are not that easy to understand as they first seem to be.
It get me a lot of time to figure out that in BPRS data subjects were not individual so there was two subject = 1 persons. I figured this out just in the end. I had fun still in spite of the challenges.
Comments on the exercises and learning
This week was really fun and I took quite a bit of liberties doing the exercises (especially with MCA part). Of course understanding is limited for methods, but there was helpful material in the web.